.Dd December 5, 2023
.Dt LLAMAFILE-QUANTIZE 1
.Os Llamafile Manual
.Sh NAME
.Nm llamafile-quantize
.Nd large language model quantizer
.Sh SYNOPSIS
.Nm
.Op flags...
.Ar model-f32.gguf
.Op Ar model-quant.gguf
.Ar type
.Op Ar nthreads
.Sh DESCRIPTION
.Nm
converts large language model weights from the float32 or float16
formats into smaller data types from 2 to 8 bits in size.
.Sh OPTIONS
The following flags are available:
.Bl -tag -width indent
.It Fl Fl allow-requantize
Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
.It Fl Fl leave-output-tensor
Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
.It Fl Fl pure
Disable k-quant mixtures and quantize all tensors to the same type
.El
.Sh ARGUMENTS
The following positional arguments are accepted:
.Bl -tag -width indent
.It Ev Ar model-f32.gguf
Is the input file, which contains the unquantized model weights in either the float32 or float16 format.
.It Ev Ar model-quant.gguf
Is the output file, which will contain quantized weights in the desired format. If this path isn't specified, it'll default to [inp path]/ggml-model-[ftype].gguf.
.It Ev Ar type
Is the desired quantization format, which may be the integer id of a supported quantization type, or its name. See the quantization types section below for acceptable formats.
.It Ev Ar nthreads
Number of threads to use during computation (default: nproc/2)
.El
.Sh QUANTIZATION TYPES
The following quantization types are available. This table shows the ID
of the quantization format, its name, the file size of 7B model weights
that use it, and finally the amount of quality badness it introduces as
measured by the llamafile-perplexity tool averaged over 128 chunks with
the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with
how recommended the quantization format is for general usage.
.Pp
.Bl -dash -compact
.It
  18 Q6_K   5.6gb +0.0446 ppl (q6 kawrakow)
.It
   7 Q8_0   7.2gb +0.0022 ppl (q8 gerganov)
.It
   1 F16    14gb  +0.0000 ppl (best but biggest)
.It
   8 Q5_0   4.7gb +0.0817 ppl (q5 gerganov zero)
.It
  17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium)
.It
  16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small)
.It
  15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium)
.It
  14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small)
.It
  13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large)
.It
  12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium)
.It
  11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small)
.It
  10 Q2_K   2.6gb +4.2359 ppl (tiniest hallucinates most)
.It
  32 BF16   14gb  +0.0000 ppl (canonical but cpu/cuda only)
.It
   0 F32    27gb   9.0952 ppl (reference point)
.It
   2 Q4_0   3.9gb +0.3339 ppl (legacy)
.It
   3 Q4_1   4.3gb +0.4163 ppl (legacy)
.It
   9 Q5_1   5.1gb +0.1091 ppl (legacy)
.It
  12 Q3_K   alias for Q3_K_M
.It
  15 Q4_K   alias for Q4_K_M
.It
  17 Q5_K   alias for Q5_K_M
.It
COPY Only copy tensors, no quantizing.
.El
.Sh SEE ALSO
.Xr llamafile 1 ,
.Xr llamafile-imatrix 1 ,
.Xr llamafile-perplexity 1 ,
.Xr llava-quantize 1 ,
.Xr zipalign 1 ,
.Xr unzip 1
